This is Zachary Sherker’s hw005 for STAT 545.
In this assignment, I will:
1)Reorder a factor based on the data and demonstrate the effect in arranged data and in figures. 2)Write data to file and load it back into R. 3)Improve a figure through the use of factor levels, smoother mechanics, color schemes. 4)Convert this to a plotly visual. # Part 1: Factor management First, I will start by uploading the required tools and data sets
suppressPackageStartupMessages(library(tidyverse))
## Warning: package 'tidyverse' was built under R version 3.4.2
## Warning: package 'ggplot2' was built under R version 3.4.4
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## Warning: package 'forcats' was built under R version 3.4.3
suppressPackageStartupMessages(library(gapminder))
## Warning: package 'gapminder' was built under R version 3.4.2
suppressPackageStartupMessages(library(forcats))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library(scales))
## Warning: package 'scales' was built under R version 3.4.4
gapminder data setgapminder data set:Oceania_drop <- gapminder %>%
filter(continent != "Oceania")
## Warning: package 'bindrcpp' was built under R version 3.4.4
str(Oceania_drop)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
I will then drop the unused Oceania data:
First_drop <- Oceania_drop %>%
mutate(continent=fct_drop(continent))
str(First_drop)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
Second_drop <- Oceania_drop %>%
droplevels()
str(Second_drop)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
country or continent The conintents in the gapminder dataset are currently ordered alphabetically:levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
We will first reorder the continents by mean GDPpercapita (mean GDP shown in plot):
gapminder %>%
mutate(continent=fct_reorder(continent,gdpPercap,mean)) %>%
ggplot(aes(continent,gdpPercap)) + geom_violin(aes(fill=continent))+
stat_summary( fun.y=mean, colour="green", geom="point", size=2,show.legend = TRUE ) +
stat_summary( fun.y=mean, colour="purple", geom="text", size = 4, show.legend = TRUE,
vjust=-0.7, aes( label=round( ..y.., digits=1 ) ) )
I will now reorder the continents by maximum gdpPercapita (max GDP shown in plot):
gapminder %>%
mutate(continent=fct_reorder(continent,gdpPercap,max)) %>%
ggplot(aes(continent,gdpPercap)) + geom_violin(aes(fill=continent))+
stat_summary( fun.y=max, colour="green", geom="point", size=2,show.legend = TRUE ) +
stat_summary( fun.y=max, colour="purple", geom="text", size = 4, show.legend = TRUE,
vjust=-0.7, aes( label=round( ..y.., digits=1 ) ) )
Finally, I will reorder the continents by minimum GDPpercapita (min. GDP shown in plot):
gapminder %>%
mutate(continent=fct_reorder(continent,gdpPercap,min)) %>%
ggplot(aes(continent,gdpPercap)) + geom_violin(aes(fill=continent))+
stat_summary( fun.y=min, colour="green", geom="point", size=2,show.legend = TRUE ) +
stat_summary( fun.y=min, colour="purple", geom="text", size = 4, show.legend = TRUE,
vjust=-0.7, aes( label=round( ..y.., digits=1 ) ) )
## Part II: File I/O I start by filtering the data to only show information from the Americas in 2002:
filterdata <- gapminder %>%
filter(continent == "Americas" & year == 2002)
# drop unused levels
AmericasData <- filterdata %>%
droplevels()
# check the levels of continent and country to be sure unused data is dropped.
str(AmericasData)
## Classes 'tbl_df', 'tbl' and 'data.frame': 25 obs. of 6 variables:
## $ country : Factor w/ 25 levels "Argentina","Bolivia",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ continent: Factor w/ 1 level "Americas": 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
## $ lifeExp : num 74.3 63.9 71 79.8 77.9 ...
## $ pop : int 38331121 8445134 179914212 31902268 15497046 41008227 3834934 11226999 8650322 12921234 ...
## $ gdpPercap: num 8798 3413 8131 33329 10779 ...
I will now write the new dataset out in a csv file:
write_csv(AmericasData,"AmericasData.csv")
And read it back in as a csv file:
read_AmeicasData<- read_csv("AmericasData.csv")
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_integer(),
## lifeExp = col_double(),
## pop = col_integer(),
## gdpPercap = col_double()
## )
I will now check the newly read-in dataset:
head(read_AmeicasData)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Argentina Americas 2002 74.3 38331121 8798.
## 2 Bolivia Americas 2002 63.9 8445134 3413.
## 3 Brazil Americas 2002 71.0 179914212 8131.
## 4 Canada Americas 2002 79.8 31902268 33329.
## 5 Chile Americas 2002 77.9 15497046 10779.
## 6 Colombia Americas 2002 71.7 41008227 5755.
# save to RDS file
saveRDS(AmericasData, "AmericasData.rds")
# read from RDS file
read_RDSdata <- readRDS("AmericasData.rds")
# check readin data
head(read_RDSdata)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Argentina Americas 2002 74.3 38331121 8798.
## 2 Bolivia Americas 2002 63.9 8445134 3413.
## 3 Brazil Americas 2002 71.0 179914212 8131.
## 4 Canada Americas 2002 79.8 31902268 33329.
## 5 Chile Americas 2002 77.9 15497046 10779.
## 6 Colombia Americas 2002 71.7 41008227 5755.
# put data into text file
dput(AmericasData, "AmericasData.txt")
# retrieve data from text file
data_txt <- dget("AmericasData.txt")
head(data_txt)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Argentina Americas 2002 74.3 38331121 8798.
## 2 Bolivia Americas 2002 63.9 8445134 3413.
## 3 Brazil Americas 2002 71.0 179914212 8131.
## 4 Canada Americas 2002 79.8 31902268 33329.
## 5 Chile Americas 2002 77.9 15497046 10779.
## 6 Colombia Americas 2002 71.7 41008227 5755.
I will start by creating a basic graph comparing the GDPpercap of all countries within continental groupings:
ggplot(gapminder,aes(gdpPercap,continent))+
geom_line(aes(colour=continent,size=gdpPercap),alpha=0.8)
I will now modify the graph to make it more informative by first reorganizing the data:
# get max, min, median and mean GDP for each contient in all years
Reorganized_data <- gapminder %>%
group_by(continent,year) %>%
summarize(
min_gdp = min(min(gdpPercap)),
max_gdp = max(max(gdpPercap)),
mean_gdp = mean(mean(gdpPercap))
)
Reorganized_table <- gather(Reorganized_data,key = "Type_GDP", value="Value_GDP", min_gdp, max_gdp,mean_gdp)
# then check the new gathered table
knitr::kable(head(Reorganized_table))
| continent | year | Type_GDP | Value_GDP |
|---|---|---|---|
| Africa | 1952 | min_gdp | 298.8462 |
| Africa | 1957 | min_gdp | 335.9971 |
| Africa | 1962 | min_gdp | 355.2032 |
| Africa | 1967 | min_gdp | 412.9775 |
| Africa | 1972 | min_gdp | 464.0995 |
| Africa | 1977 | min_gdp | 502.3197 |
| Now I will p | lot thi | s data: |
Reorganized_graph <- Reorganized_table %>%
ggplot(aes(x = year, y = Value_GDP, color = Type_GDP) ) +
facet_wrap(~continent) +
scale_y_log10(label=dollar_format())+
scale_x_continuous()+
geom_point()+
geom_line()+
labs(x = "year",
y = "GDP",
title = "Variables of GDPpercap per continent per year")+
theme(axis.text = element_text(size= 10),
strip.background = element_rect(fill = "green"),panel.background = element_rect(fill = "white"))
Reorganized_graph
As you can see, this reorganized data makes for a much more informative graph, allowing us to observe larger trends in a simple format. (2) Convert graph to
plotly
ggplotly(Reorganized_graph)
The plotly version of this graph allows for us to access the information portrayed by simply scrolling our mouse over the data points, making the graph much more informative still. ## Part IV: Writing figures to file
ggsave("modified_graph.png", width=16, height=6, units = "cm")